Recall that HMC simulates the frictionless flow of a particle on a surface. In any given transition, which is just a single flick of the particle, the total energy at the start should be equal to the total energy at the end. That’s how energy in a closed system works. And in a purely mathematical system, the energy is always conserved correctly. It’s just a fact about the physics.
But in a numerical system, it might not be. Sometimes the total energy is not the same at the end as it was at the start. In these cases, the energy is divergent. How can this happen? It tends to happen when the posterior distribution is very steep in some region of parameter space. Steep changes in probability are hard for a discrete physics simulation to follow. When that happens, the algorithm notices by comparing the energy at the start to the energy at the end. When they don’t match, it indicates numerical problems exploring that part of the posterior distribution.
centered parameterization
In his lecture, McElreath uses CENTERED PARAMETERIZATION to demonstrate divergent transitions. A very simple example:
This expression is centered because one set of priors (the priors for \(x\)) are centered around another prior (the prior for \(\nu\)). It’s intuitive, but this can cause a lot of problems with Stan, which is probably why McElreath used this for his example. In short, when there is limited data within our groups or the population variance is small, the parameters \(x\) and \(\nu\) become highly correlated. This geometry is challenging for MCMC to sample. (Think of a long and narrow groove, not a bowl, for your Hamiltonian skateboard.)
Code
set.seed(1)# plot the likelihoodsps <-seq( from=-4, to=4, length.out=200) # possible parameter values for both x and nucrossing(nu = ps, x=ps) %>%#every possible combination of nu and xmutate(likelihood_nu =dnorm(nu, 0, 3),likelihood_x =dnorm(x, 0, exp(nu)),joint_likelihood = likelihood_nu*likelihood_x ) %>%ggplot( aes(x=x, y=nu, fill=joint_likelihood) ) +geom_raster() +scale_fill_viridis_c() +guides(fill = F)
The way to fix this is by using an uncentered parameterization:
\[\begin{align*}
x &= z\times \text{exp}(\nu) \\
z &\sim \text{Normal}(0, 1) \\
\nu &\sim \text{Normal}(0, 3) \\
\end{align*}\]
Code
set.seed(1)# plot the likelihoodsps <-seq( from=-4, to=4, length.out=200) # possible parameter values for both x and nucrossing(nu = ps, z=ps) %>%#every possible combination of nu and xmutate(likelihood_nu =dnorm(nu, 0, 3),likelihood_z =dnorm(z, 0, 1),joint_likelihood = likelihood_nu*likelihood_z ) %>%ggplot( aes(x=z, y=nu, fill=joint_likelihood) ) +geom_raster() +scale_fill_viridis_c() +guides(fill = F)
It’s an important point, except the issues of centered parameterization are so prevalent1, that brms generally doesn’t allow centered parameterization (with some exceptions). So we can’t recreate the divergent transition situation that McElreath demonstrates in his lecture.
McElreath describes the problem of fertility in Bangladesh as such:
Family: bernoulli
Links: mu = logit
Formula: use.contraception ~ 1 + (1 | district)
Data: d (Number of observations: 1934)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~district (Number of levels: 60)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.52 0.09 0.37 0.70 1.00 1374 1915
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -0.54 0.09 -0.72 -0.37 1.00 1998 2342
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
We could model the interaction between condition (presence/absence of another animal) and option (which side is prosocial), but it is more difficult to assign sensible priors to interaction effects. Another option, because we’re working with categorical variables, is to turn our 2x2 into one variable with 4 levels.
In this experiment, each pull is within a cluster of pulls belonging to an individual chimpanzee. But each pull is also within an experimental block, which represents a collection of observations that happened on the same day. So each observed pull belongs to both an actor (1 to 7) and a block (1 to 6). There may be unique intercepts for each actor as well as for each block.
as_draws_df(m3) %>%select(starts_with("sd")) %>%pivot_longer(everything()) %>%ggplot(aes(x = value, fill = name)) +geom_density(linewidth =0, alpha =3/4, adjust =2/3, show.legend = F) +annotate(geom ="text", x =0.67, y =2, label ="block", color ="#5e8485") +annotate(geom ="text", x =2.725, y =0.5, label ="actor", color ="#0f393a") +scale_fill_manual(values =c("#0f393a", "#5e8485")) +scale_y_continuous(NULL, breaks =NULL) +ggtitle(expression(sigma["group"])) +coord_cartesian(xlim =c(0, 4))
exercise
Return to the data(Trolley) from an earlier lecture. Define and fit a varying intercepts model for these data, with responses clustered within participants. Include action, intention, and contact. Compare the varying-intercepts model and the model that ignores individuals using both some method of cross-validation.
solution
data(Trolley, package="rethinking")# fit model without varying interceptsm_simple <-brm(data = Trolley,family = cumulative, response ~1+ action + intention + contact, prior =c( prior(normal(0, 1.5), class = Intercept) ),iter=2000, warmup=1000, cores=4, chains=4,file=here("files/data/generated_data/m71.e1"))# fit model with varying interceptsm_varying <-brm(data = Trolley,family = cumulative, response ~1+ action + intention + contact + (1|id), prior =c( prior(normal(0, 1.5), class = Intercept),prior(normal(0, 0.5), class = b),prior(exponential(1), class = sd)),iter=2000, warmup=1000, cores=4, chains=4,file=here("files/data/generated_data/m71.e2"))
Posterior predictions in multilevel models are a bit more complicated than single-level, because the question arises: predictions for the same clusters or predictions for new clusters?
In other words, do you want to know more about the chimps you collected data on, or new chimps? Let’s talk about both.
predictions for chimps in our sample
Recall that the function fitted() give predictions.